Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6.3.0] Implement failure circuit breaker #18541

Merged
merged 3 commits into from
Jun 1, 2023

Conversation

amishra-u
Copy link
Contributor

@amishra-u amishra-u commented May 30, 2023

Issue

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

Solution

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

Closes #18359
commit 5575ff2

Copy of bazelbuild#18120: I accidentally closed bazelbuild#18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes bazelbuild#18136

Closes bazelbuild#18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704
@google-cla
Copy link

google-cla bot commented May 30, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@amishra-u amishra-u changed the base branch from master to release-6.3.0 May 30, 2023 22:53
@amishra-u amishra-u marked this pull request as ready for review May 31, 2023 00:05
@Pavank1992 Pavank1992 added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels May 31, 2023
@iancha1992 iancha1992 closed this May 31, 2023
@iancha1992 iancha1992 removed team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels May 31, 2023
@iancha1992 iancha1992 requested review from coeuvre and removed request for iancha1992 May 31, 2023 20:19
@iancha1992 iancha1992 reopened this May 31, 2023
@iancha1992
Copy link
Member

iancha1992 commented May 31, 2023

@coeuvre could you approve this into release-6.3.0?

@iancha1992 iancha1992 enabled auto-merge (squash) May 31, 2023 20:38
@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jun 1, 2023
@Franetse7
Copy link

Le paramètrage devrais être modifier pour fusionner les code de fonction open source

@iancha1992 iancha1992 merged commit 31f07cc into bazelbuild:release-6.3.0 Jun 1, 2023
copybara-service bot pushed a commit that referenced this pull request Jul 24, 2023
Baseline:  758b44d

Release Notes:

+ Automatic code cleanup. (#18417)
+ Update CODEOWNERS for 6.3.0 (#18369)
+ Overrides specified by non-root modules no longer cause an error, and are silently ignored instead. They were originally treated as an error to allow for the future possibility of overrides in the transitive dependency graph working together; but we've deemed that infeasible (and even if it was, it'd be so complicated and confusing to users that it would not be a good addition). (#18388)
+ Add implementation deps support for Objective-C (#18372)
+ Update release notes scripts (#18400)
+ Prevent CredentialHelperEnvironment crash when invoking Bazel outside of a workspace. (#18430)
+ Use wall-time for credential helper invalidation (#18413)
+ blaze_util_posix: handle killpg failures (#18403)
+ Pass version to java_runtimes created by local_java_repository (#18415)
+ Add jsonproto option to query --output flag (#18438)
+ Don't eagerly flatten a `NestedSet` in `RepoMappingManifestAction` (#18419)
+ rules_go & rules_python are failing in Downstream CI with Bazel@HEAD (#18447)
+ Move credential helper setup into remote_helpers.sh so it can be reused by other shell tests. (#18453)
+ Wire credential helper to repository fetching. (#18429)
+ Updates/fixes to relnotes script (#18470)
+ Report percentual download progress in repository rules (#18471)
+ Support remote symlink outputs when building without the bytes. (#18476)
+ Enrich local BEP upload errors with file path and digest possible. (#18481)
+ Set `GTEST_SHARD_STATUS_FILE` in test setup (#18482)
+ Fix relnotes script (#18491)
+ Fix Xcode 14.3 compatibility (#18490)
+ Fix #18493. (#18514)
+ Extend the credential helper default timeout to 10s. (#18527)
+ Fix formatting of release notes (#18534)
+ Use extension rather than local names in ModuleExtensionMetadata (#18536)
+ [credentialhelper] Ignore all errors when writing stdin (#18540)
+ Improve error on invalid `-//foo` and `-@repo//foo` options (#18516)
+ Implement failure circuit breaker (#18541)
+ Actually check `TEST_SHARD_STATUS_FILE` has been touched (#18418)
+ Ignore hash string casing (#18414)
+ Error if repository name isn't supplied (#18425)
+ Track repo rule label attributes after the first non-existent one (#18412)
+ Add ServerCapabilities into RemoteExecutionClient (#18442)
+ RemoteExecutionService: support output_symlinks in ActionResult (#18441)
+ RemoteExecutionService: Action.Command to set output_paths (#18440)
+ Use local_termination_grace_seconds when testing LinuxSandbox availability (#18568)
+ Fix dangling string literal in `extension_metadata` docs (#18598)
+ Include actual MODULE.bazel location in stack traces (#18612)
+ Make cpp file extensions case sensitive again (#18552)
+ Fix error when script is run after the final tag is created. (#18638)
+ Fix WORKSPACE toolchain resolution with `--enable_bzlmod` (#18649)
+ Add `ActionExecutionMetadata` as a parameter to `ActionInputPrefetcher#prefetchFiles`. (#18656)
+ Use failure_rate instead of failure count for circuit breaker  (#18559)
+ Update ignored_error logic for circuit_breaker (#18662)
+ Don't rewind the build if invocation id stays the same (#18670)
+ Fix potential memory leak in UI (#18659)
+ Test that a credential helper can supply credentials for bzlmod. (#18663)
+ Add flag --experimental_collect_code_coverage_for_generated_files. (#18664)
+ Options specified on the pseudo-command `common` in `.rc` files are now ignored by commands that do not support them as long as they are valid options for *any* Bazel command. Previously, commands that did not support all options given for `common` would fail to run. These previous semantics of `common` are now available via the new `always` pseudo-command. Closes #18130. (#18609)
+ Fix split post-processing of LLVM-based coverage (#18737)
+ Allow module extension usages to be isolated (#18727)
+ BEGIN_PUBLIC (#18729)
+ Declare credential helpers to be a stable feature. (#18752)
+ Add a new provider for injecting native libs in android_binary (#18753)
+ Properly handle invalid credential files (#18779)
+ The REPO.bazel and MODULE.bazel files are now also considered workspace boundary markers. (#18787)
+ Report remote execution messages as events (#18780)
+ Fail on isolated extension usages without imports (#18793)
+ Add changes to cc_shared_library from head to 6.3 (#18606)
+ Remove option to disable FJP. (#18791)
+ Update to latest turbine version (#18803)
+ None. None (#18808)
+ Wait for outputs downloads before emitting local BEP events that reference these outputs. (#18815)
+ Perform builtins injection for WORKSPACE-loaded bzl files. (#18819)
+ Fix non-declared symlink issue for local actions when BwoB. (#18817)
+ Make grep_includes optional inside cc_common.register_linkstamp_compile_action (#18823)
+ add feature on windows toolchain with right tag (#18654)
+ coverage_common.instrumented_files_info now has a metadata_files argument (#18838)
+ Download directory output for test actions (#18846)
+ Teach DexMapper to not separate synthetic classes from their context … (#18853)
+ **[Incompatible]** query --output=proto --order_output=deps now returns targets in topological order (previously there was no ordering). (#18870)
+ Revert "Don't eagerly flatten a `NestedSet` in `RepoMappingManifestAction` (#18419)" (#18886)
+ Additional source inputs can now be specified for compilation in cc_library targets using the additional_compiler_inputs attribute, and these inputs can be used in the $(location) function. Fixes #18766. (#18882)
+ Open-source Google test `ConvenienceSymlinkTest` (#18890)
+ Update Error Prone to 2.20.0 (#18885)
+ Check if json.gz files exist, not the gcov version. (#18889)
+ Lockfile updates (#18894)
+ handle exception instead of crashing (#18895)
+ Add a new provider for passing dex related artifacts in android_binary (#18899)
+ Prevent most side effects of yanked modules (#18908)
+ Restore the classic desugar tool in the Bazel 6.3.0 branch so that the Bazel Android tools can be built for 6.3.0 without breaking backwards compatibility (#18909)
+ Update java_tools to v12.5 (#18868)
+ Add ActionCacheStatistics to BEP (#18914)
+ Adjust --top_level_targets_for_symlinks (#18916)
+ Track dev/non-dev `use_extension` calls (#18918)
+ Overrides specified by non-root modules no longer cause an error, and are silently ignored instead. They were originally treated as an error to allow for the future possibility of overrides in the transitive dependency graph working together; but we've deemed that infeasible (and even if it was, it'd be so complicated and confusing to users that it would not be a good addition). (#18921)
+ Rollforward of https://github.com/bazelbuild/bazel/commit/482d2be27ab… (#18773)
+ Update Android tools to 0.27.2 for fixes to DexMapper for https://gith... (#18891)
+ Report dev/non-dev deps imported via non-dev/dev usages (#18922)
+ Add reverted 'isolate' changes (#18928)
+ Identify isolated extensions by exported name (#18923)
+ test-setup.sh: Attempt to raise the original signal once more (#18932)
+ Ignore broken classic desugar tests (#18933)
+ Disable UseCorrectAssertInTests by default (#18948)
+ Fix VS 2022 autodetection (#18960)
+ Fix absolute file paths showing up in lockfiles (#18993)
+ Add support for isolated extension usages to the lockfile (#19008)

Acknowledgements:

This release contains contributions from many people at Google, as well as amishra-u, Andreas Herrmann, Andy Hamon, andyrinne12, Benjamin Lee, Benjamin Peterson, Brentley Jones, Chirag Ramani, Christopher Rydell, Daniel Wagner-Hall, Ed Schouten, Fabian Brandstetter, Fabian Meumertzheim, Greg, Ivan Golub, Jon Landis, JY Lin, Kai Zhang, Keith Smiley, kotlaja, lripoche, oquenchil, Pavan Singh, Rasrack, Son Luong Ngoc, Takeo Sawada, Vertexwahn, Xùdōng Yáng, Yannic.
iancha1992 pushed a commit that referenced this pull request Jul 24, 2023
Baseline:  758b44d

Release Notes:

+ Automatic code cleanup. (#18417)
+ Update CODEOWNERS for 6.3.0 (#18369)
+ Overrides specified by non-root modules no longer cause an error, and are silently ignored instead. They were originally treated as an error to allow for the future possibility of overrides in the transitive dependency graph working together; but we've deemed that infeasible (and even if it was, it'd be so complicated and confusing to users that it would not be a good addition). (#18388)
+ Add implementation deps support for Objective-C (#18372)
+ Update release notes scripts (#18400)
+ Prevent CredentialHelperEnvironment crash when invoking Bazel outside of a workspace. (#18430)
+ Use wall-time for credential helper invalidation (#18413)
+ blaze_util_posix: handle killpg failures (#18403)
+ Pass version to java_runtimes created by local_java_repository (#18415)
+ Add jsonproto option to query --output flag (#18438)
+ Don't eagerly flatten a `NestedSet` in `RepoMappingManifestAction` (#18419)
+ rules_go & rules_python are failing in Downstream CI with Bazel@HEAD (#18447)
+ Move credential helper setup into remote_helpers.sh so it can be reused by other shell tests. (#18453)
+ Wire credential helper to repository fetching. (#18429)
+ Updates/fixes to relnotes script (#18470)
+ Report percentual download progress in repository rules (#18471)
+ Support remote symlink outputs when building without the bytes. (#18476)
+ Enrich local BEP upload errors with file path and digest possible. (#18481)
+ Set `GTEST_SHARD_STATUS_FILE` in test setup (#18482)
+ Fix relnotes script (#18491)
+ Fix Xcode 14.3 compatibility (#18490)
+ Fix #18493. (#18514)
+ Extend the credential helper default timeout to 10s. (#18527)
+ Fix formatting of release notes (#18534)
+ Use extension rather than local names in ModuleExtensionMetadata (#18536)
+ [credentialhelper] Ignore all errors when writing stdin (#18540)
+ Improve error on invalid `-//foo` and `-@repo//foo` options (#18516)
+ Implement failure circuit breaker (#18541)
+ Actually check `TEST_SHARD_STATUS_FILE` has been touched (#18418)
+ Ignore hash string casing (#18414)
+ Error if repository name isn't supplied (#18425)
+ Track repo rule label attributes after the first non-existent one (#18412)
+ Add ServerCapabilities into RemoteExecutionClient (#18442)
+ RemoteExecutionService: support output_symlinks in ActionResult (#18441)
+ RemoteExecutionService: Action.Command to set output_paths (#18440)
+ Use local_termination_grace_seconds when testing LinuxSandbox availability (#18568)
+ Fix dangling string literal in `extension_metadata` docs (#18598)
+ Include actual MODULE.bazel location in stack traces (#18612)
+ Make cpp file extensions case sensitive again (#18552)
+ Fix error when script is run after the final tag is created. (#18638)
+ Fix WORKSPACE toolchain resolution with `--enable_bzlmod` (#18649)
+ Add `ActionExecutionMetadata` as a parameter to `ActionInputPrefetcher#prefetchFiles`. (#18656)
+ Use failure_rate instead of failure count for circuit breaker  (#18559)
+ Update ignored_error logic for circuit_breaker (#18662)
+ Don't rewind the build if invocation id stays the same (#18670)
+ Fix potential memory leak in UI (#18659)
+ Test that a credential helper can supply credentials for bzlmod. (#18663)
+ Add flag --experimental_collect_code_coverage_for_generated_files. (#18664)
+ Options specified on the pseudo-command `common` in `.rc` files are now ignored by commands that do not support them as long as they are valid options for *any* Bazel command. Previously, commands that did not support all options given for `common` would fail to run. These previous semantics of `common` are now available via the new `always` pseudo-command. Closes #18130. (#18609)
+ Fix split post-processing of LLVM-based coverage (#18737)
+ Allow module extension usages to be isolated (#18727)
+ BEGIN_PUBLIC (#18729)
+ Declare credential helpers to be a stable feature. (#18752)
+ Add a new provider for injecting native libs in android_binary (#18753)
+ Properly handle invalid credential files (#18779)
+ The REPO.bazel and MODULE.bazel files are now also considered workspace boundary markers. (#18787)
+ Report remote execution messages as events (#18780)
+ Fail on isolated extension usages without imports (#18793)
+ Add changes to cc_shared_library from head to 6.3 (#18606)
+ Remove option to disable FJP. (#18791)
+ Update to latest turbine version (#18803)
+ None. None (#18808)
+ Wait for outputs downloads before emitting local BEP events that reference these outputs. (#18815)
+ Perform builtins injection for WORKSPACE-loaded bzl files. (#18819)
+ Fix non-declared symlink issue for local actions when BwoB. (#18817)
+ Make grep_includes optional inside cc_common.register_linkstamp_compile_action (#18823)
+ add feature on windows toolchain with right tag (#18654)
+ coverage_common.instrumented_files_info now has a metadata_files argument (#18838)
+ Download directory output for test actions (#18846)
+ Teach DexMapper to not separate synthetic classes from their context … (#18853)
+ **[Incompatible]** query --output=proto --order_output=deps now returns targets in topological order (previously there was no ordering). (#18870)
+ Revert "Don't eagerly flatten a `NestedSet` in `RepoMappingManifestAction` (#18419)" (#18886)
+ Additional source inputs can now be specified for compilation in cc_library targets using the additional_compiler_inputs attribute, and these inputs can be used in the $(location) function. Fixes #18766. (#18882)
+ Open-source Google test `ConvenienceSymlinkTest` (#18890)
+ Update Error Prone to 2.20.0 (#18885)
+ Check if json.gz files exist, not the gcov version. (#18889)
+ Lockfile updates (#18894)
+ handle exception instead of crashing (#18895)
+ Add a new provider for passing dex related artifacts in android_binary (#18899)
+ Prevent most side effects of yanked modules (#18908)
+ Restore the classic desugar tool in the Bazel 6.3.0 branch so that the Bazel Android tools can be built for 6.3.0 without breaking backwards compatibility (#18909)
+ Update java_tools to v12.5 (#18868)
+ Add ActionCacheStatistics to BEP (#18914)
+ Adjust --top_level_targets_for_symlinks (#18916)
+ Track dev/non-dev `use_extension` calls (#18918)
+ Overrides specified by non-root modules no longer cause an error, and are silently ignored instead. They were originally treated as an error to allow for the future possibility of overrides in the transitive dependency graph working together; but we've deemed that infeasible (and even if it was, it'd be so complicated and confusing to users that it would not be a good addition). (#18921)
+ Rollforward of https://github.com/bazelbuild/bazel/commit/482d2be27ab… (#18773)
+ Update Android tools to 0.27.2 for fixes to DexMapper for https://gith... (#18891)
+ Report dev/non-dev deps imported via non-dev/dev usages (#18922)
+ Add reverted 'isolate' changes (#18928)
+ Identify isolated extensions by exported name (#18923)
+ test-setup.sh: Attempt to raise the original signal once more (#18932)
+ Ignore broken classic desugar tests (#18933)
+ Disable UseCorrectAssertInTests by default (#18948)
+ Fix VS 2022 autodetection (#18960)
+ Fix absolute file paths showing up in lockfiles (#18993)
+ Add support for isolated extension usages to the lockfile (#19008)

Acknowledgements:

This release contains contributions from many people at Google, as well as amishra-u, Andreas Herrmann, Andy Hamon, andyrinne12, Benjamin Lee, Benjamin Peterson, Brentley Jones, Chirag Ramani, Christopher Rydell, Daniel Wagner-Hall, Ed Schouten, Fabian Brandstetter, Fabian Meumertzheim, Greg, Ivan Golub, Jon Landis, JY Lin, Kai Zhang, Keith Smiley, kotlaja, lripoche, oquenchil, Pavan Singh, Rasrack, Son Luong Ngoc, Takeo Sawada, Vertexwahn, Xùdōng Yáng, Yannic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bazel run test_target doesn't convey test exit code
5 participants